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Abstract 

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences 
between two random variables without making parametric assumptions about their underlying distributions. 
We refer to the most common settings as mean difference alternatives (MDA), for testing differences only 
in first moments, and general difference alternatives (GDA), which is about testing for any difference in dis¬ 
tributions. A large number of test statistics have been proposed for both these settings. This paper connects 
three classes of statistics - high dimensional variants of Hotelling’s t-test, statistics based on Reproducing 
Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the following question - 
how much statistical power do popular kernel and distance based tests for GDA have when the unknown 
distributions differ in their means, compared to specialized tests for MDA? 

To answer this, we formally characterize the power of popular tests for GDA like the Maximum Mean 
Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance 
with the Euclidean norm (eED) in the high-dimensional MDA regime. We prove several interesting prop¬ 
erties relating these classes of tests under MDA, which include 

(a) eED and gMMD have asymptotically equal power; furthermore they also enjoy a free lunch be¬ 
cause, while they are additionally consistent for GDA, they have the same power as specialized 
high-dimensional t-tests for MDA. All these tests are asymptotically optimal (including matching 
constants) for MDA under spherical covariances, according to simple lower bounds. 

(b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice 
made by the median heuristic. 

(c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and 
quadratic-time versions of these tests, with more computation resulting in higher power. 
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All three observations are practically important, since point (a) implies that eED and gMMD while being 
consistent against all alternatives, are also automatically adaptive to simpler alternatives, point (b) suggests 
that the median “heuristic” has some theoretical justification for being a default bandwidth choice, and point 
(c) implies that expending more computation may yield direct statistical benefit by orders of magnitude. 


1 Introduction 

Nonparametric two sample testing (or homogeneity testing) deals with detecting differences between two 
distributions, given samples from both, without making any parametric distributional assumptions. More 
formally, given samples Xi, ...,X„ ~ P and Yi, Q, where P and Q are distributions in the 

most common types of two sample tests involve testing for the following sets of null and alternate hypotheses 

General difference alternatives (GDA): Hq : P = Q \s Hi : P ^ Q, 

Mean difference alternatives (MDA): Hq : = fj,Q vs Hi : fip ^ fig 

where /ip := EpA,/ig := EgY. This problem has a sustained interest in both the statistics and machine 
learning literature, due to applications where the sample size might be limited compared to dimensionality, 
due to experimental or computational costs. For example, it can be used to answer questions in medicine 
(is there a difference between pill and placebo?) and neuroscience (does a particular brain region respond 
differently to two different kinds of stimuli?). 

We will assume m = n for simplicity, though our results may be extended to the case when m/{n + m) 
converges to any constant k G (0,1). A test ?/ is a function from Xi, ...A„, Yi,..., Y„ to {0,1}, where we 
reject Hq when p = 1. We will only consider tests that have an asymptotic type-I error of at most a. Let us 
call the set of all such tests as 


[riUdp ■■= {ri : x ^ {0,1}, Ep^ry < a + o(l)}. (1) 

In the Neyman-Pearson paradigm for the fixed d setting, a test is judged by its power </> = ((){n, P, Q, a) = 
Epj^?/, and we say that such a test p G [r]]n,d,a is consistent in the fixed d setting when 

EpiP —i' l,EppP < a as n —>■ oo for any fixed a > 0. 

In contrast, we say that a test p G [r]]n,d,a is consistent in the high-dimensional setting when its power 
(j) = (j){n,dn,Pn,Qn,a) = Epjp satisfies 

—>■ 1, Ep^ < a as (n, d) —>■ oo, for any fixed a > 0 

where one also needs to specify the relative rate at which n, d can increase. The central question being 
considered in this paper is “what is the power of tests designed for GDA, compared to those designed for 
MDA, when the distributions truly differ in their means?”. We will explain this and other related questions in 
more detail in Section |3] 

Remark 1. The tests considered in this paper have some common properties. All the test statistics T are 
centered under the null, i.e. Ep^T = 0, dividing the statistic by \Jvar(T) leads to an asymptotically standard 
normal statistic under the null, i.e. T/^var(T) ~s- iV(0,1) under Hq, where represents convergence in 
distribution as n ^ oo, and hence all tests are of the form: 

p(Ai,...,A„,Yi,...,Y„) = l( J— > 

\ffvar{T) 

where Za is the 1 — a quantile of the standard normal distribution. 
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Two-sample testing is a fundamental decision-theoretic problem, having a long history in statistics - for 
example, the past century has seen a wide adoption of the t-statistic by Hotelling 1 1931) to decide if two 
samples have different population means (MDA). It was introduced in the parametric setting for univariate 
Gaussians, but it has been generalized to multivariate non-Gaussian settings as well. If X,Y are the sample 
means, and S is a joint sample covariance matrix, then a statistician using the multivariate f-test calculates 

Th ■■= {X -YfS-\X -Y) 

and the test is 1{Th / \/Var{T}j) > to) where ta is chosen so that < a -I- o(l)). Th is consistent 

for MDA whenever P, Q have different means, and further, it is known to be the “uniformly most powerful” 
test when P, Q are univariate Gaussians under fairly general assumptions | |Kariya| |1981| | Simaika[ |1941| 

I Anderson! 1 195 8| |Salaevskii| 1 1971 1 . 

In a seminal paper by Bai and Saranadasa 1 1996| , the authors proved that Th has asymptotic power 
tending to a in this high-dimensional setting (as discussed in the next section), motivating the study of 
alternative test statistics. Despite their increasing popularity and usage, many interesting questions remain 
unanswered, as will be discussed in Section and partially answered in this paper. This paper deals with 
(moderately) high-dimensional and nonparametric two-sample testing, where d can grow polynomially with 
n, and there are no explicit parametric assumptions on P, Q. In Section]^ we experimentally validate our 
claims for a variety of distributions, even at quite small sample sizes and dimensions. This shows that the 
asymptotics accurately describe even finite sample behavior of these tests. 


Paper Outline. The rest of this paper is organized as follows. In Section]^ we introduce three classes of 
tests in the literature - Hotelling-based tests for MDA, and kernel-based and distance-based tests for GDA, 
and we discuss related open questions in Section]^ In Section]^ we prove that three of the most popular tests 
(one from each class) have the same asymptotic power for MDA, showing the free adaptivity of GDA-based 
tests for the simpler MDA problem. In Section]^ we show that all these classes of tests are optimal for MDA 
under the diagonal covariance setting, by adapting a lower bound from the normal means problem. Section 
[^discusses computation-statistics tradeoffs, where we compare the power of linear-time, sub-quadratic time 
and quadratic-time versions of these tests. In Section we run experiments and discuss some practical 
implications of this work. We end with the proofs in Section]^ 

Notation We use the standard o,op,Op notation extensively. Also, for two non-random sequences , Bn, 
An = fl{Bn) is the negation of = o(i3„), = uj{Bn) is the negation of = 0{Bn), and x f3„ 

to mean = Bn{c + o(l)) for some absolute constant c. Tr{) is the trace of a (square) matrix and Tr^i) 
is the k-th power of the trace, o is the elementwise or Hadamard product, TsQ refers to the total sum of all 
the elements of a matrix, is the i-th standard basis vector, 1 is the vector of ones, -w is convergence in 
distribution, and I(-) is a 0-1 indicator function. 


2 Hotelling-based MDA Tests and Kernel/Distance-based GDA tests 


Tests for MDA. As mentioned in the introduction, Bai and Saranadasa]P996| prove that Hotelling’s Th has 
power tending to a (this is called trivial power), when (n, c?) —?> oo with d/n ^ 1 — e for small e, explained 
by the inherent difficulty of accurately estimating the 0{d^) parameters of with very few samples {S~'^ 
is not even defined if d > n and is badly conditioned if d is of similar order as n). To avoid this problem, 
they proposed to use the test statistic 


Tbs ■■= \\X-Yf-tiiS)/n 

and showed that it has non-trivial power whenever d/n —c € (0, (X)). An important precursor to this non- 


parametric work ofjBai and Saranadasa 

||1996| is that of|Dempster|| 19581 who proposed a high-dimensional 

t-test for Gaussians. 

Srivastava and Du 

2008 

and 

Srivastava et al. 

2013| proposed to instead use diag(S') ^ 
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instead of S~^, in Th, and showed its advantages in certain settings over Tbs (specifically its scale invari¬ 
ance, i.e. invariance when the data is rescaled by a diagonal matrix, gives it an advantage when the covariance 
matrices are diagonal but non-spherical). 

In another extension of Tbs by Chen and Qin | 2010| , henceforth called CQ, the authors proposed a 
variant of Tbs of the form 


Tcq ■= 


n{n — 1) 


E 


1 




n{n — 1) 






i,i=l 


analyzing its power for MDA when the covariances of X, Y are also unequal and without explicit restrictions 
on d, n, but rather in terms of conditions stated in terms of n, S and mean difference S := fj,p — fig. We will 
return to these conditions later in this paper, since we will use assumptions of similar flavor. 

Note that E[rc( 3 ] = /ip/r_p + ftg/iQ — 2/ip/iQ = \\fip — /rglp, and hence Teg is an unbiased estimator 
of Wfip — In this paper, instead of using Teg directly, we will analyze a minor variant, which is a 

U-statistic: 


1 

^ hcgiX,,X„Y„Y,) 

'' '' i^j=i 


where hcg{X, X', Y, Y') := X^X + Y^Y - X^Y' - X'^Y. 


( 2 ) 


Teg’s difference from Ueg is only in the third term, and this difference is asymptotically vanishing, making 
the asymptotic properties of Ueg (especially its power) identical to Teg, and its usage is only for technical 
convenience. 

There is also a large literature on the so-called parametric Behrens-Fisher problem, which is a parametric 
MDA problem where the distributions are Gaussian and heteroskedastic, and also the nonparametric Behrens- 
Fisher problem that deals with MDA when P, Q are nonparametric mean-scale families, in the univariate and 
multivariate settings. See Belloni and Didier |2008| and Lopes et al. |2011| for recent such works, and 
references therein. Another related line of work analyzes the setting where p could be exponentially larger 
than n but assuming some kind of sparsity (say in the mean difference); see |Cai et al.||2014) for such an 
example. 

Tests for GDA. It is well known that the Kolmogorov-Smimov (KS) test by Kolmogorov 1 1933| and 
Smirnov] ||1948[ involves differences in empirical CDFs. The KS test, the related Cramer von-Mises criterion 
by Cramer 11928| and Von Mises 1 1928| , and Anderson-Darling test by [Anderson and Darling [ 1952| are 
very popular in one dimension, but their usage has been more restricted in higher dimensions. This is mostly 
due to the curse of dimensionality involved with estimating multivariate empirical CDFs. While there has 
been work on generalizing these popular one-dimensional to higher dimensions, like Bickel 1 1969| , these 
are seemingly not the most common multivariate tests. Some other examples of univariate tests include rank 
based tests as covered by the book Lehmann and D’Abrera 1 2006| and the runs test bjJ^ldandWolfowfe 



while some interesting multivariate tests include spanning tree methods by Friedman and Rafsky 
, nearest-neighbor based tests by |S chilling| 1 1 98^ and |Henz^||1988[, and the “cross-match” tests by 


Rosenbaum |20051. Most of these have been proved to be consistent in the fixed d setting, but not much is 


known about their power in the high-dimensional setting. 

One popular class of tests for the multivariate GDA problem that has emerged over the last decade, are 
kernel-based tests introduced in parallel by Fernandez et al. |2008| and Gretton et al. | 2006| , and expanded 
on in|Gretton et al.|p012a[. The Maximum Mean Discrepancy between P, Q is defined as 


MMD(i/„,P,Q) := 


max^i Ep/(x) - EQ/(y) 


where TJk is a Reproducing Kernel Hilbert Space associated with Mercer kernel fc(-, •), and {/ : ||/||b^ < 1} 
is its unit norm ball. It is easy to see that MMD > 0, and also that P = Q implies MMD = 0. For the 
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converse. 


Gretton et al.| 1 2006| show that under fairly general conditions involving or equivalently k, the 


equality holds iff P = Q. The authors prove that 

MMD(Pk,P, Q) = ||EpK(a:,.) - Eq«:(j/, 


This gives rise to a natural associated test, that involves thresholding the following U-statistic, an unbiased 
estimator of MMD^: 


MMD2(fc(.,.)) 

where h^{X, X\YX) 


_ 1 ) X! hK{Xi,Xj,Yi,Yj) 

k(X, X') + k{Y, Y') - k{X, Y') - nix', Y). 


(3) 


Note once again that we can form a gMMD statistic having 3 summations like Tcq, but for technical con¬ 
venience we mimic the form of the U-statistic Ucq, the asymptotic properties of both being the same. Note 
that Ucq is just the MMD when we use the linear kernel k(a, b) = a^b. The most popular kernel for GDA is 
the Gaussian kernel with bandwidth parameter 7 , leading to the test statistic that we henceforth call gMMD: 

gMMD^ := 
where g^{a,b) := exp 




Apart from the fact that the population gMMD^(P, Q) = 0 iff P = Q the other fact that makes this a useful 
test statistic is that its estimation etTor, i.e. the error of MMD^ in estimating MMD^, scales like Ijy/n, 
independent of d; see Gretton et al.[ | |2012a | for a detailed proof of this fact. This is unlike the KL divergence, 
for example, which is 0 iff P = Q but is hard to estimate in high-dimensions. However, it was recently 
argued in Ramdas et al. |20151 that the study of estimation error covers only one side of the story, and that 
test power still degrades with d even if estimation error does not. 

A related but different class of tests are distance-based “energy statistics” as introduced in parallel by 
Baringhaus and Franz 1 2004| and Szekely and Rizz^ P004|, and generalized to some kinds of metrics. 


denoted p, for a related independence testing problem, by Lyons [ 2013| . The test statistic is called the 
Cramer statistic by the former paper but we use the term Energy Distance as done by the latter, and once 
more, we study the U-statistic form: 


1 

ED„(p(.,.)) := ^ 

nin — 1 ) “ 

where hpiX,X',Y,Y') ■= p{X,Y') + p{X',Y) - p{X,X') - p{Y,Y')- 


(4) 


The most popular or “default” choice within this class (the only one studied by both sets of authors who 
introduced it) is the Energy Distance with the Euclidean distance, henceforth called eED, defined as 


eED„ := ED„(e(-,-)) 
where e(a, 6 ) := |ja — 6 || 2 . 

Appropriately thresholding gMMD^ and eED„ leads to tests that are consistent for GDA in the fixed d 
setting against all fixed alternatives where P ^ Q (and some local alternatives, i.e. alternatives that change 
with n) under fairly general conditions and such results can be found in the associated references. However 
not much is known about them in the high dimensional regime. 

Remark 2. This paper will deal largely with gMMD and eED, because these are the most popular choices 
for kernel and distance used in practice, but similar inferences can possibly be made about other kernels and 
distances, using the same proof technique. Similarly, we will focus on Ucq, though one may draw similar 
inferences about Tbs ond Tsd cmd their corresponding GDA variants. 
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3 Open Questions and Summary of Results 

The test statistics for MDA, like Ucq,Tbs, Tsd,Th have all been analysed in the high-dimensional setting. 
However, there is presently poor understanding of gMMD and eED in high dimensions. Below we list some 
of these open questions (along with explanations) that we are going to answer in this paper, followed by our 
partial answers to these questions. 


Ql. How can one characterize the power of nonparametric tests like gMMD and/or eED in high dimen¬ 
sions, either for GDA or MDA? 


Explanation [Ql ]. In the fixed d setting, gMMD and eED are well understood, and their null and alter¬ 
nate distributions are given in Gretton et al.H2012a | and |Szekely and Rizzo |2004| respectively. However, 
their behavior in high dimensions seems to be essentially unanswered in the current literature. A general 
characterization of power is impossible since P, Q could be different yet arbitrarily similar to each other (see 
Section 3.2 of Gretton et al.| | 201^ for a formal statement and proof of this claim). Due to this reason, one 
is somewhat restricted to trying to characterize the power in limited settings. For example, one can hope to 
characterize the power by parameterizing the problem in terms of the smallest moment in which P, Q differ. 

Result [Ql [. One way that we propose to analyze them is to consider two nonparametric distributions 
P, Q that only differ in one specific moment and see how much power gMMD or eED have to identify this 
difference and reject the null. As a first step, this paper will characterize their power for MDA, when P, Q 
differ only in their first moment. 


Q2. How does the choice of bandwidth parameter 7 affect power of gMMD^, for GDA or MDA? 


Explanation [Q 2 [. The most popular choice of bandwidth is the “median heuristic” where it is chosen 
as the median Euclidean distance between all pairs of points (s ee [Scholkopf and Smola | 2002| ). However, 
the effect of this choice on test power is unclear. Gretton et al.| 12012b | also make suggestions for choosing 
the bandwidth parameter, but only for the linear-time gMMD; (see Section]^, and also with guarantees only 
in the fixed d setting. Hence the study of how the kernel bandwidth affects power is a work in progress in 
the current literature. For any fixed 7, consistency for GDA was proved in [Gretton et al. 1 2006) ; further, 
the power of gMMD^ against any fixed GDA alternative was also explicitly derived in the fixed d setting 
to be ignoring constants, where $ is the Gaussian CDF. Notice that consistency of the gMMD test 

for wy fixed 7 is in stark contrast to using Gaussian kernels for density estimation, where we must let the 
bandwidth go to zero with increasing n, and hence the gMMD statistic does not behave in the same way as 
the F2-distance between kernel density estimates, as done in jAnderson et al.|P994| . 

Result [Q 2 [. In Section we prove that the power of gMMD^ does not depend on the bandwidth 
parameter 7, as long as 7 is chosen to be asymptotically larger than the choice made by the aforementioned 
median heuristic. 


Q3. Can one directly compare the power of eED and gMMD for GDA or MDA? Is one of them more 
powerful than the other? 


Explanation [Q 3 [. Sejdinovic et al. [2013) describes connections between kernel and distance based tests 
for independence testing. Informally speaking, there is a near one-to-one correspondence between the class 
of kernels and distances for which such tests make sense. However, while there is some metric/semimetric 
that corresponding to Gaussian kernel g, that metric/semimetric is not the Euclidean distance e (and vice 
versa). eED seems to be more popular in the statistics literature, and gMMD in machine learning - it is of 
practical importance to both fields to know how one should choose between eED and gMMD. 

Result [Q 3 [. In Section]^ we show that (under fairly general conditions) gMMD and eED have asymp¬ 
totically equal power for MDA, both in theory and practice. 
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Q4. How do the powers of tests for GDA compare to tests for MDA, when (unknown to us) P, Q actually 
differ in only their means? 


Explanation [Q4]. Given a nonparametric two-sample testing problem, one generally does not know 
if the distributions differed in their means or not. If they did differ in their means, presumably the former 
statistics may perform worse than the latter, since the latter are designed specifically for that purpose, and can 
concentrate all their power in detecting first moment differences. But how much worse? What is the price one 
must pay for the extra generality of gMMD and eED? One of the main questions considered in this paper is 
actually one of comparing the powers of eED, gMMD and Ucq- 

Result [Q4]. In Section]^ we prove that one does not pay any price for the generality of gMMD^, eED^ 
(they enjoy a “free lunch”) - gMMD^ and eED„ have the same power as Ucq against MDA in high dimen¬ 
sions, both in theory and practice, even though gMMD^ and eED„ are also consistent against GDA whereas 
Ucq is not. We would like to note that this result has actually been observed in practice, but seemingly 
not been explicitly acknowledged or conjectured. Figures 1 and 4 of |Baringhaus and Franz|p004| are quite 
convincing for eED, and the authors explicitly point this out in their experiments and conclusion sections, 
while Figures 3 and 4 of Lopes et al. 1 201 1| also show same phenomenon for gMMD, though the latter au¬ 
thors do not comment on their experimental observation. As far as we know, this paper has the first rigorous 
justification of such a phenomenon. 


Q5. How does computation affect power in high dimensions? 


Explanation [Q5]. A final question we consider is the relat ionship between comp utat ion and power. 


Noting that gMMD„ takes quadratic time i.e. 0{n^) to compute. 


Gretton et al. 


12012 a and 




Zaremba et al. 


12013) introduce linear-time a nd block-based sub quadratic-time statistics gMMD/ and gMMD^. The main 


Reddi et al. 


1 2015), which analyses a linear-time version of gMMD/ in the 


related work in this regard is 

high-dimensional setting. We will discuss this last question in detail in Section]^ 

Result [Q5]. In Section we show that expending more computation yields a direct statistical benefit 
of higher power; there is clear and smooth statistics-computation tradeoff for a family of earlier proposed 
sub-quadratic and linear time (kernel) two sample tests. 


Q6. What are the lower bounds for two sample testing in high dimensions? 

Explanation [Q6]. We have not seen any lower bounds for the two sample testing problem in the literature, 
and definitely none for the high dimensional setting, even under MDA. 

Result [Q6]. In Section]^ we prove tight lower bounds for two-sample testing under MDA, for the case 
of diagonal covariance, which show that all three tests are optimal in this setting, even including constants. 


4 Adaptivity of gMMD and eED to MDA 


This section will aim to provide some answers to questions Ql-4. Our main assumptions are inspired by 
those in Bai and Saranadasa 1 1996) and Chen and Qin |2010|, and related followup papers. 


[Al] Model. Xi = TZii + Pp and Yi = TZ 2 i + pq for i = 1,..., n where Zn, Z^i are fc-dimensional 
independent zero mean, identity covariance random variables and T is ad x D unknown full-rank determin¬ 
istic transformation matrix for some D > d satisfying FF' = E (hence the d x d population covariance E is 
full-rank). Denote the mean difference as 6 := pp — pQ. 


Remark 3. Assumption [Al] implies that X, Y have means pi, p 2 and covariances E, like in Bai and 
Yfe do not assume that X, Y have different covariances Ei, E 2 like in \Chen and Qin 


\2010^. The reason for this choice is as follows. gMMD and eED can detect differences in distributions 
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P, Q that occur in any finite moment. For example, by Bochner’s theorem (see \Rudir^^l962^ ), the population 
quantity gMMD^ is precisely (up to constants) 

f \(pxit) - ipYit)f e~^ 11*11 dt 

JTS.'i 

where px{t) = Ea;~p[e~** “] is the characteristic function of X at frequency t (similarly (pyit)), and the 
population eED is precisely (up to constants) 


[Fx(a, t) — Fyia, i)]^da dt 


(a,t)GS‘*- 


where Fx{a,t) = P{a^X < t) (similarly Fy{a,t)) is the population C DF of X when projected along 
direction a and S‘^~^ is the surface of the d dimensional unit sphere; see Szekely and Rizzo [ 2004^ for a 
proof Because of this, gMMD and eED are sensitive to differences in second (and higher) moments of 
distributions. To analyze their power against MDA, it makes sense to nullify all other sources of signal like 
IIEl — S 2 |||i that might alter the power o/gMMD or eED. 


[A2] Moment assumption. Each of the D coordinates of Zn and Z 2 i have m > 8 moments, each 
moment being a finite constant. For all j = l,...,n and s = 1,2, we have = 

ai )E(Z“= ' 


EiZ:^)E(Z:^)...EiZ'^^^ 


) for all J2f=i a, < 8. 


Remark 4. Assumption [A2 ] was made in essentially the same form in Bai and Saranadasa and Chen\ 

and Qin 1)2010(1. Some of our calculations explicitly involve how much these moments deviate from those of 


a standard Gaussian. Wfe show in Section that many of our results hold experimentally for a variety of 
non-Gaussian distributions. 


[A3] Fairly good conditioning of E. (a) We assume that Tr(E^^) = o(Tr‘^(Yf)) for k = 1,2. (b) We 
also assume that Tr{E) x d and for Si £ {Xi,Yi}, the average US'! — SjW^/d exponentially concentrates 
around its expectation, i.e. 


P 


d 


d 



—> 0 exponentially fast in (some polynomial of) d. 


for some u = p(E, m) £ (1/3,1/2]. 


Remark 5. Assumption [A3] essentially means that E is fairly well conditioned, and was also made in the 
aforementioned earlier works. To see this, note that if E = then the conditions reduce to requiring 
d = o(d^). If all the eigenvalues ofE are bounded, this assumption is still met. When E’s eigenvalues are 
not bounded, this condition will be satisfied as long as E is not terribly conditioned. This assumption is 
discussed in detail with several nontrivial examples in Chen and Qin f[2010 1. Similarly, p(E, m) reflects the 
conditioning of E, and the number m of moments of S. In the best case, with d independent coordinates 
i.e. identity covariance E = J and infinite moments, v(E,m) = 1/2. Ai we assume fewer moments or as 
we deviate away from diagonal covariance to more ill-conditioned matrices, p(E, m) strays away from half, 
but we assume it is fairly well-conditioned, being at least 1/3. We think that some such good conditioning is 
necessary for our theorems to hold, but that the scalar 1 /3 can be lowered. 


[A4] Low signal strength. ||(5|p = o ^min | Aniin(E), and J^E^^ = o(Tr(E''+^)) for 

k = 0,1,2, 3. 
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Remark 6 . First recall that we assumed S is full rank in Assumption [A1 ], so A„iin(S) > 0. Assumption [A4] 
essentially means that the signal strength is not very large relative to the noise. For example, when S = cr^F 
the assumption requires that ||<5|p/cr^ = o{'/d). Indeed, it more generally implies that ||(J|p = o(rr(I]) m 
We need this assumption for technical reasons, and we conjecture that our results hold under a weaker 
assumption. Even in its present form, this is not such a strong assumption since (as we shall see in the 
theorem statements) if the signal strength is large then the decision problem becomes too easy and such a 
regime is rather uninteresting. Further note that 6^6 = o(Tr(E)) implies, by Cauchy-Schwarz, 

< A„,ax(E)||5f = o(A„,ax(S)Tr(S)), 

< rr(E2)||5f = o(Tr(E2)Tr(E)), 

= o(Tr(E3)rr(E)) < o{Tr{i:^)Tr‘^(T,)). 


[A5] High-dimensional setting, n = o(d?'^ ^Tr(E^)) = o(-\fdTr‘^{Yj)) = o(df"^). 

Remark 7. Currently, Assumption [A5] is needed only for a technicality in proving our main theorem, and 
we conjecture that it can be relaxed. 


As in Chen and Qin | 2010) , we do not assume that (n, d) ^ oo at any particular rate. Instead, we will 
analyze their behavior in two regimes that have implicit control on n, d. For notational convenience, denote 


:= 8 - 


rr(E2) 

n 


(5) 

(6) 


Recalling that S := p,p — p,Q, the first theorem summarizes the power of Ucq- 
Theorem 1. Under [A1 ], [A2] and [A3a], Ucq has asymptotic power which equals 


4>cq = $ - 


TriS^) 


Tr(S^) S^SS 


• Za + 


ll<5|| 




o(l) 


(7) 


where $ is the Gaussian CDF and Za is the threshold representing the a-quantile of the standard Gaussian 
distribution. 


This theorem follows from the main result of Chen and Qin |2010| for Uqq, and hence we do not re¬ 
produce it here. There, the authors prove that Uqq is asymptotically normally distributed with variance 
tj^i + (7^2 under the alternative, and variance under the null (with Ei = E 2 = E and ni = n 2 = n 
being used by us). This then gives rise to the above expression for the power f fairly easily, except that the 
authors made a small mistake by interchanging cr„i and (T „2 in one crucial expression (confirmed by email 


correspondence with the authors, summarized in the Appendix Sec. 


Ab. Another minor difference is that we 


write down the power as a single expression, while Chen and Qin 1 2010) prefer to write them down in the 
two aforementioned special cases of low and high SNR. 

Remark 8 . The null distribution of Ucq is asymptotically Gaussian under MDA in this high-dimensional 
setting. This is in stark contrast to the fixed-d, increasing-n setting, where the null distribution is an infi¬ 
nite sum of weighted chi-squared distributions, due to the properties of degenerate U-statistics (see \Serfiing\ 
[2009^). This seems to have first been proved by Bai and Saranadasa /|7 996’^ for Tps using a martingale 
central limit theorem (see Hall and Heyde ^2014^). 


’This holds because rr(S) = Tr(S^S ’) < rr(S^)A^[^(S) by Cauchy-Schwai'z inequality that Tr(A-^i?) < ||yl||»||iJ|| 
where *, op refer to the nuclear and operator norms respectively. 
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The next theorem summarizes the power of gMMD, which is also one of the main results of the paper. 

Theorem 2. Assume [A1 ], [A2], [A3], [A4] and [AS], and let the bandwidth be chosen as 7 ^ = Lo{2Tr{Y[)). 
Then gMMD..^ has asymptotic power which is independent of 7 , and equals the power of UcQ- In other 
words, the power is 


(/'gMMD = 


Tr(Y.^) 




Tr(T.^) 


\/8^ + 8 


S'l'Y.S 


+ o{l) 


for all 7 ^ = a;(2Tr(E)). 

The proof of this theorem is covered in Section While one may conjecture a result like the above 
due to the claims of |El KarouilpOlO) that the Gaussian kernel often behaves like the linear kernel in high 
dimensions, their results only hold true when n x d (apart from other differences in assumptions). Further, 
they also interpret the results rather pessimistically, by saying that these kernels do not provide an advantage 
in the high-dimensional setting, but we will demonstrate in experiments that when the linear kernel does not 
suffice (the distributions have the same mean but differ in their variances), then Ucq has trivial power but 
gMMD’s power tends to one in reasonable scenarios. Of course, more samples are probably needed to detect 
differences in second moments compared to differences in first moments.Hence, we choose to interpret the 
above result optimistically — not only is gMMD capable of detecting any difference in distributions, but it 
also detects differences in means as well as Ucq which is designed to test only mean differences. 

For the purpose of mathematical analysis, we now introduce a family of statistics, for which eED„ is a 
special case. These are defined (recalling Fq.Q) as 

eED-y ;= ED„(e-),(-, •)) 

where e.^(a, 5) := \J— 2Tr{T,) + ||a — 6 II 2 


where 7 ^ > 2Tr(E) is a constant user-chosen bandwidth parameter. Note that 


lim eED™ = eED„ 

'y^-^2Tr{^S) + 


The next theorem summarizes the power of eED-^, in all cases when 7 ^ = uj{2Tr{T,)). 


Theorem 3. Assume [A1 ], [A2], [A3], [A4] and [AS], and let the bandwidth be chosen as 7 ^ = uj(2Tr{T,)). 
Then eED-^ has asymptotic power which is independent of j, and equals the power of Ucq- In other words, 
the power is 


(/>eED = 


Tr(SD 


ll«ll 






+ o{l) 


for all 7 ^ = oj{2Tr{Y,)). 


The proof of this theorem is similar to the proof of Theorem]^ and hence is briefly covered at the end of 
Section]^ after the proof of Theorem]^ 

Remark 9. We remark on our inability to prove the above theorems for the limiting case of x 2Tr{Yf). 
The proofs of Theorems ^ and ^ are based on a Taylor expansion of the and hp respectively (recall 
Eqs.^,^ for their definition). This leads to a “dominant” Taylor term Uijy^ which is a U-statistic in /12 
and a “remainder” term C/ 4 / 7 ^ which is a U-statistic in / 14 , where 

= ||x - x'f + ||r - y'f - ||x - r'f - ||x' - yf, 

= ||x - y'f + ||y - y'f - ||y - rf - ||x' - yf. 


h2{X,X’,YX) 

hi{X,X',Yy) 
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( 8 ) 

(9) 



























One can easily observe that h 2 = —2hcQ (see Eq. and hence the behavior ofU 2 is immediately captured 
by the behavior ofUcQ, the most important fact being that U 2 is always Gaussian under the null and the 
alternative (as mentioned after Theorem^and its following remarks). When 7 ^ = uj(Tr(T,)), we prove 
that C/ 4 / 7 ^ = 0 p(fJ 2 ly^\ However, when 7 ^ x 2Tr(Yi), our results suggest that = Op(U 2 /y^). 

However, while we know that 6 ^ 2 / 7 ^ is asymptotically Gaussian, we do not know the limiting distribution of 
even though we undertake tedious calculations to find the mean and variance of Up Hence, while 
this allows us to make arguments about the mean and variance o/gMMD and eED, we cannot make power 
claims since for that purpose we require knowing the limiting distribution of U 4 under the null. While we 
conjecture that it is indeed Gaussian and simulations support this, the proof is vastly more complicated than 
for U 2 because the number of terms to be controlled in the martingale central limit theorem is larger (by 
an order of magnitude, as the number of terms grows exponentially). Proving the above theorem statements 
for the limiting case is an important direction for future work, and may require development of the theory of 
U-statistics for high dimensional variables. However, for the moment we show a variety of experiments that 
support our conjecture, implying that the borderline case is probably a technical limitation. 


4.1 The Special Case of S = a^I 


Though no explicit assumptions are placed on n, d for the above expression (and hence for consistency to 
hold), for further understanding of the power of these tests, let us consider the situation when E = a^I and 
define the signal-to-noise ratio (SNR) as 


SNR ^ := 


ll^ll 


One can think of as the problem-dependent constant, which determines how hard the testing problem is - 
of course, the larger the SNR, the easier the distributions are to distinguish. Indeed, in the special case of P, Q 
being spherical Gaussians, is just the KL-divergence between these distributions. Then, the expression 
for power from Eq.Q simplifies to 


$ _ 


s/d 


^2 


\/d + sj'&d/'n? + 8 T'^/n 


0 ( 1 ). 


( 10 ) 


We are most interested in the regimes where T' is small. Let us define the three regimes as follows: 


Low SNR: 

T' = o(s/dJn), 

( 11 ) 

Medium SNR: 

T' X s/djn, 

( 12 ) 

High SNR: 

4/ = ui(s/djn). 

(13) 


Remark 10. We find it worthy to note that the behavior is differentia the low and high SNR regime. Specif¬ 
ically, in the Low SNR regime, the asymptotic power is 


4>l = ‘^ 



when T* = o(s/d/n) 


while in the high SNR regime, the asymptotic power is 


(pH = ^{s/n^/s/S) when T* = uj{\/d/n). 


(14) 


(15) 


The above two rates match in the Medium SNR regime, yielding a power x ^(s/d). 

^There is a mistake/typo in the paper by |Chen and Qin||2010| , which causes them to miss this surprising observation. We have 
confirmed this important typo with the authors, and describe the context of its occurrence in more detail in the Appendix Sec.lAl 
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5 Lower Bounds when E = a‘^1 


Here we show that the form of the power achieved in Theorem[2is not improvable under certain assumptions. 
For example, in the case when S = we can provide matching lower bounds to Eq. 10 using techniques 
from [Ingster and Suslina 120031 designed for Gaussian normal means problem. The proof relies on the 
Gaussian approximations of the central and noncentral chi-squared distributions. 


Proposition 1. Let Gd{x, 0) be the cdf of a central chi-squared distribution with d degrees of freedom and 
Gd{x,r) be the cdf of a noncentral chi-squared distribution with d degrees of freedom and noncentrality 
parameter r. Then as d ^ oo, we have uniformly over x, r 


Gd{x,0) 


Gd{x,r'^) 

GdiTdc,,r^) 


^ f X — d — r'^ 
$ 


$ 


I /- I ' o(l)i 

V\/2d-f 4r2 J 

f v/Ei 

\ ~ y/2d + 4r2 


2 ( 1 ) 


(16) 

(17) 

(18) 


where Tda is 1 — a quantile cutoff of the cind Za is the corresponding quantile of the standard normal. 

Remark 11. Our Eq. ^l^ differs from \Ingster and Suslina 1^003^ [Ch 1.3, Pg 13, Eq. 1.14] where the authors 
applied the additional approximation that d ^ oo with r fixed (or just d » r) to get 


G(Td„,r2 ) = - pVv^) + o(l)- 


(19) 


We do not make this approximation. 

Proof of Proposition^ The first two expressions appear verbatim in [Ingster and Suslina] |2003) [Ch 1.3, Pg 
12]. Substituting x = Tda into the second expression yields 


Gd{Tda,r^) = ^ 


Tda — d 
y/2d T 4r2 


s/2d-\- 4r2 


+ 0 ( 1 ) 


The last expression then follows due to the following fact; 


^( 20 ) 

Eq.(|20ll holds by the following argument. Eirst note that 

{xl-d)/V^^N{0,l). 


Then by definition of Tda, 


P{Xd > Tda) < a 


which then implies 

for standard normal Z. Since we know that P{Z > Za) < a, Eq.(|20ll follows. 
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Next, define Sd{p) = G | |ii5|| = p} to be the surface of the d-dimensional sphere of radius p. For 
the normal means problem, we are given Z ~ N{5, Id) and we test Hq : 6 = 0 against Hi : S G Sd{p)- 
Recalling the definition of [r]]n,d,a from Eq.Q, we analogously define [p]d,a for the normal means problem 
as the set of all tests from [0,1] with expected type-1 error at most a. Define the minimax power at 

level a as 

P{p,a) := inf sup 

v(^lv]d,c seSdip) 

Proposition 2. Given Z ^ N{S, Id) where ||5|| = p, the minimax power for the normal means problem is 


P{p,a) = l-Gd{Tdo..P^) = ^\- 




--T, 


s/2d + V y/2d -b V 


o(l). 


Proof. This proposition is almost verbatim from Proposition 2.15 of Pg 69 of Ingster and Suslina | 2003| . Its 
proof is given in Example 2.2 on pg 51 of Ingster and Suslina p003| , the end of the example yielding the 
expression for power as Gd{Tda, p^)- The only difference in our proposition statement is that we directly use 
the expression Gd{Tda, p^) in Eq.([T8|l instead of the approximation in Eq.(|T9]l. 


The above proposition now directly yields a lower bound for two sample testing when E = a^I. Let 
Td{p.a) := {{P,Q) : Ep[X] -Eq[F] G 5d(p),E[XX^] -E[X]E[X]^ = E[Fy^] -E[F]E[y]^ = a^/} 
represent the set of all pairs of d-dimensional distributions P, Q whose means differ by d G Sd{p) and whose 
covariances are both a^I. Define the minimax power at level a as 


I3{p,a,a):= inf sup Ep^gp. 

ri&lriU.d.c {P,Q)eTdip,<7) 


Theorem 4. Given Xi, ^ N{0,a^ld) and Yi, ...Y„ ~ N{S,a^Id), suppose we want to test (5 = 0 

against 6 G Sd{p). Then putting T' := pja, the minimax power is 


I3{p,a,a) = $ 


Vd + nvl/2 “ 


^8d/n^ + 8#7 



+ 0 ( 1 ) 


Proof Denote 


Under the null, 

and under the alternate 

for 6 G Sd{p'), where p' 
proposition]^ 


^ = E 

i 


X,-Y, 

\/2a^fn 


\fnj2 


jX-Y) 

a 


Z^N{0,Id) 


Z^N{5,Id) 

\Jn/2p/a, i.e. p'"^ = nT'^/2. Our claim follows by direct substitution into 



Remark 12. This lower bound expression exactly matches the upper bound expression in Eq. including 
matching constants, showing that all of the discussed tests are minimax optimal in this setting of'E = a^I. 
Even though the current lower bounds can possibly be strengthened to include nondiagonal S, we remark 
that we have not been able to find even these diagonal-covariance lower bounds in the two sample testing 
literature, especially which are accurate even to constants. 
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6 Computation-Statistics Tradeoffs 


In this section we will consider computationa lly c heaper alternatives to c omputing the quadratic time gMMD^ 
that were suggested in Gretton et al. |2012aj and Zaremba et al. |20131, namely a block-based gMMD^ and 


Rao-Blackwellized U-statistic), it is not clear how much worse the other options are - if they are only slightly 
worse, the computational benehts could be worth it if there is a large amount of data. Due to the lack of a 
high-dimensional analysis in Gretton et al. | 2012a| , it was inferred that one suffers for cheaper computation 
with power that is worse, by a constant factor compared to the power of gMMD^. We will show that, for 
MDA, the power is worse not by constants but by exponents of n (presumably this would only get worse 
for GDA). At all points, the Assumptions in Section 3 are assumed to hold wherever needed, so that we can 
proceed directly to comparisons. 

Assume that we divide the data into B = B{n) blocks of size n/B with n/B ^ oo. Let gMMD^( 6 ) be 
the gMMD^ statistic evaluated only on the samples in block b G {1,..., B}, and let the block-based MMD 
be dehned as 

D 

1 


gMMD| = — ^gMMD2(&). 


6=1 


We note that this statistic takes (n/B)^B = v?/B time to compute. 

Also, when using B = n/2, i.e. using blocks of size just 2, since n/B —> oo does not hold, we look at 
this case separately. This statistic just takes linear-time to compute, since each block b is just of size 2, and 
we dehne the linear time MMD as 


nj2 

gMMDi = ^ 51 gMMD"( 6 ). (21) 

6=1 

Theorem 5. Under assumptions [Al], [A2], [A3], [A4], [A5] (appropriately holding for n/B points), and 
the bandwidth is chosen as 7 ^ = w(Tr(Yf)), the power o/gMMD^ is 




VB\\5f 




B^TrCS^) 


iBS^SS 



+ 0 ( 1 ). 


Proof. Let and 0-^2 bs as dehned in Eqs.(|^,(|^, but each calculated on n/B points instead of n points, 
and scaled by 7 ^, i.e. 


^2 

O^Bl 

2 

^B2 


y^n'^ 

^BS'^ES 

8 —1 . 


Dehne + 1732 ■ Then from our earlier arguments we have that 

Under iJo, gMMD'^(b) -w N{0,a%^), (22) 

Under iJi, gMMD2(6) -w N(0, + ( 7 ^ 2 )- (23) 

Hence, the distribution of gMMD^ is A^(0, a’^^/B) under null and A^(gMMD^, under alterna¬ 

tive. Hence, from our earlier results it is straightforward to note that under Hq, 


Csi 


W(0,1) 
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and under Hi, 




gMMD| - gMMD^ 

CTB 


N{0,1). 


Hence our test statistic will be 


Tb := Vb 


gMMD|j 


CTl 


with our test being given by I(Tb > Za) where Za is the a quantile cutoff of the standard normal distribution. 
Note that in practice, we would simply use a studentized statistic by plugging in the estimated cti . Then, the 
power of this test is 






/^ gMMDl-gMMD^ ^ _ v^gMMD^ 


= 1 - $ Z, 


= $ 


CTb 

gbi V^gMMD^ 
<yB ctb 

VB\\sr 


,-B27V(S2) qBST^S 


ctbi 

t 


(25) 

(26) 


It is again useful to consider the case of S = a^I for some insight, and recall T* = ||(5||/cr. Specifically, 
the power is 

/ 

= $ 


(t>L when T' = o{y^Bd/n) 


\V8iM 

while in the very high SNR regime, the power behaves like 


(27) 


when T' = oj{i/Bd/n). 


(28) 


Of course, the above two rates match in the Medium SNR regime. Here we use the italicized very because 
it is a ^/B times larger SNR requirement than the high SNR regime given in Eq.(|T3]l of T* = uj{^/d/n). 
Comparing to Eqs. ([T3 i,([T 5]| to the ones above, in the very high SNR regime i.e. T* = uj{y/Bd/n), we have 

(I^H = 4>H- 

However, the low SNR regime is statistically more interesting. In this case, the power of the block test is ^/B 
times worse (inside the $ transformation). Noting that the block based test takes time n?IB to compute, we 
see the factor nj'/B in Eq.(|27ll quite illuminating (it is the square-root of the time taken). 

It was proved in |Reddi et al.|pOT3| that the power of the linear-time statistic is given by 

V\/ 8 d+W J 


and hence its power in the low SNR regime is given by $ in the (very very) high SNR regime of 

T' = u}{'/d), its power does not suffer, and is exactly <I>(-\/n'k/v^) like all the above statistics, but in the low 
SNR regime its dependence on n suffers (and again it is the square-root of the computation time taken). 


Remark 13. We can summarize this section informally as follows. If the test statistic takes time rd to compute 
for 1 <t <2 then the power behaves like $ ^ in the low SNR regime. 
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7 Experiments 

In our experience, our claimed theorems hold true much more generally in practice. For example; 

1. While we need n, d to be polynomially related in theory, we find that our experiments show that (j)cQ = 
4 >eED = </'gMMD even when n is fixed and d increases, or when d is fixed and n increases. 

2. While our theory seems to suggest that 7 ^ = uj(Tr{Y,)) is needed, the experiments suggest that 7 ^ = 
r2(Tr(E)) suffices. 

Before we describe our experimental suite, let us first detour to mention the “median heuristic”. 


7.1 The Median Heuristic 


The median heuristic chooses the bandwidth for the Gaussian kernel as the median pairwise distance between 
all pairs of points (see SchoUcopf and Smola 120021). In other words, it chooses 


7 ^ = Empirical Median 11| S' — S"|p} 


where S ^ S' G {Xi, ...,Xn,Yi, ...,y„}. To have some idea of the order of magnitude of the choice 
that median heuristic makes, let us make the reasonable supposition that this choice is similar to the mean- 
heuristic, which chooses it to be the average distance between all pairs of points, i.e. let us assume for 
argument’s sake that 


Empirical Median {|| S — S'|p} x Population Mean 11|S — S'|p} . 


Then the following proposition captures the order of magnitude of the bandwidth choice made by the common 
median heuristic. 


Proposition 3. Under [Al], the average distance between all pairs of points is x 2Tr(T,). Hence, under 
[Al], the median-heuristic chooses 7 ^ x 2Tr[Yf). 

Proof. There are ( 2 ) pairs of xs and ( 2 ) pairs of ys and vf xy pairs, the total number of pairs being . This 

implies that the population mean pairwise distance is | \X — X'\\‘^-\- E11F — E'|p + yi^E11X — F|p. 

(2 j (21 (2 j 

E||X-X'f = E||(X-pi)-(X'-^i)f = 2E(X-pi)^(X-^i) 

= 2ETr((X-pi)(X-/ri)'^) =2Tr(E). 


E||X-Ff = E||Xf+E||Ff - 2EX^F 

= E|1X - pif + IlMif + E||F - + ||p 2 f - 

= 2Tr(E) + ||5f. 


Together, these imply our claim. 



Remark 14. The above proposition implies that the choice made by the median heuristic is at the borderline 
of satisfying the condition under which our main theorem holds, which is 7 ^ = Uj(Tr(Ti)). Practically, in our 
experiments that follow, it seems like all the claims still seem to hold even when 7 ^ x Tr{X). This implies 
that the conditions currently needed for our theory are possibly stronger than needed. Hence, this “heuristic” 
actually provides a reasonable default bandwidth choice since E is usually unknown. 
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7.2 Practical accuracy of our theory 


Here, we consider a wide variety of experiments and demonstrate that our claims hold true with great accuracy 
in practice, and actually in greater generality than we can currently prove. 

The different test statistics considered in this simulation suite (as given in the legends) are; 

1. uMMDO.5 - gMMD with 7 x i.e. 7 ^ x Tr(E). 


2 . 

3. 

4. 

5. 

6 . 


uMMD Median - gMMD with 7 chosen by the aforementioned median heuristic. 

uMMDO.75 - gMMD with 7 x i.e. 7 ^ = w(Tr(S)). 

ED - (Euclidean) energy distance eED, i.e eED.^ with 72 = 2 rr(E). 

uCQ - The U-statistic Ucq from Chen and Qin | 2010| . 

- The linear-time gMMD^ statistic from Eq.(|2T|) with ^ G {0.5, 0.75, Median} specifying 
the bandwidth as in the case of gMMD above. 


7. ICQ - The linear-time version of Ucq- 

We plot the power of all these tests statistics when a = 0.05, for various P, Q by running 100 repetitions 
of the two sample test for each parameter setting. As a one sentence summary of all the experiments that 
follow, we find that all the U-statistics have exactly the same power under mean-differences, as claimed 
by our theorems, i.e. (j)cQ = '^^’gMMD = 4>ed for all the above choices of bandwidth, while the linear¬ 
time statistics perform significantly worse, also as predicted by the theory (demonstrating the computation- 
statistics tradeoff). 


Experiment 1. Eor this experiment we use the following distributions. We vary d from 40 to 200 and 
always draw n = d samples from the corresponding P, Q. 

• Normal distribution with diagonal covariance: P = iV(/io, Idxd) and Q = A^(/ii, Idxd) where /ig = 
(0...0)TandAii = ^(l...l)T. 

• Product of Laplace distributions: P and Q are shifted Laplace distributions with shifts po = (0 ■ • ■ 0)^ 
and /j,i = -^(1... 1)^ respectively and identity covariance matrix. 


• Product of Beta distributions: P and Q are shifted Beta distributions Beta( 1 , 1) with shifts fiQ = 

(0 ... 0 )^, /ii = ^^^ ( 1 ... 1 )^ respectively and identity covariance matrix. 

• Mixture of Gaussian distributions: P and Q are shifted mixture of Gaussians ^N{0, Idxd) + ^N{0,2Idxd) + 
^N{0,3Idxd) with shifts /ig = (0 ... 0 )^ and /ii = respectively. 

The values of shifts and covariance matrix are chosen to keep the asymptotic power same for all the 
distribution (see Theorem]^. Pigure[T] shows the performance of various estimators for the aforementioned 
two sample test settings. It is clear that the power of eED, Teg, gMMD all coincide for any (sufficiently 
large) bandwidth, increasing as $(y^) for the quadratic time statistic, and staying constant for the linear 
time statistics, both as predicted by the theory. Also note the fact that the plots look almost identical is 
consistent with our theory (see Theorem]^. 

Experiment 2; In the previous experiment, we have seen the performance of the estimators for diagonal 
covariance matrix. Here, we empirically verify that similar effects can be observed in distributions with 
non-diagonal covariance matrix. To this end, we consider distributions P = iV(/rg, S') and Q = S') 

where /ig = (0 ... 0)^, jii = ;^(1 • • • 1)^ and E' = UA'U^. The matrix (7 is a random unitary matrix U 
obtained from the eigenvectors of a random Gaussian matrix. A' is set as follows. Let A be a diagonal matrix, 
the entries of which are equally spaced betw een 0.01 and 1, raised to the power 6 . This experimental setup 
is similar to one used in 


Lopes et al. |2011 


The matrix A' is d . f-s . EigureH shows that the qualitative 


tr(A) ‘ ^ II _ 

performance of all statistics is similar to one observed in the previous experiment (see Eigure[T]). 
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Figure 1: Power vs Dimension when P, Q are mean-shifted Normal (top left), Laplace (top right). Betas 
(bottom left) or Mixture (bottom right plot) distributions. 



d 


Figure 2; Power vs d when P, Q are mean-shifted Normal (top left) with non-diagonal covariance matrix. 
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Experiment 4. The aim of this experiment is to study the performance of the statistics when distributions 
differ in covariances rather than means. In this experiment, we set P = N{0, Ei) and Q = N{0, S 2 ) where 
El = nl^ and E 2 = Here, E is a positive definite matrix UAU^ where U and A are generated 

as described in Experiment 2. Again, the experimental setup is similar to the one used in |Lopes et al. [ 201 1| . 
Not surprisingly, as seen in Figure gMMD and eED perform better than CQ. 


0 

$ 

o 

Q. 



d 


Figure 3; Power vs d when P, Q are distributions differing in Covariances. 


This experiment demonstrates that gMMD and eED dominate Ucq in some sense. This is due to the fact 
that CQ is designed for mean-shift alternatives while rest of them work for more general alternatives. Hence, 
they achieve the same power when the distributions differ in their means, and strictly higher power when the 
distributions do not differ in their means, but only in some higher moment. We can also see that the powers 
of the different statistics are no longer equal, and that the bandwidth does matter in this situation. 


Experiment 5. Finally, we verify the nature of the asymptotic power for fixed dimension. For the purpose 
of this experiment, we hold d fixed to value 40 and vary n. Here, we consider two sample tests for normal 
distributions with diagonal and non-diagonal covariance matrices (used in Experiment 1 and Experiment 2 
respectively). Figurej^illustrates the power of the tests under this scenario. It can be seen that power increases 
with n in a manner similar to the ones observed in the previous experiments. 



■^uMMDO.5 
s^lMMDO.5 
i^uMMDO.75 
X IMMD0.75 
- uMMD Median 
IMMD Median 
^uCQ 
iCQ 
ED 



-uMMDO.5 
-iMMDO.5 
-uMMDO.75 
iMMDO.75 
■uMMD Median 
-iMMD Median 
-uCQ 
-iCQ 

-ED Wi 


Figure 4: Power vs Sample size for fixed dimension when P, Q are normal distributions with diagonal (left 
plot) and non-diagonal (right plot) covariance matrices respectively. 
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This experiment suggests that assumption [A5] can probably be relaxed or dropped from the theory. We 
need it only to bound a certain Taylor remainder term in the proof of the theorems that follows, and it is 
perhaps possible to find a better way to bound this term. 

8 Proofs of Theorems |2] and |3] 

Let us first note that the gMMD statistic can be written as 


gMMD = 


K/YYn - 1) 

T 

\Kxx 

Kxy] 

ln/\/?2(n - 1) 

.-ln/\/n(n - 1) 


[kIy 

Kyy\ 

-ln/\/n(n - 1)_ 


where u = 


In/V^ 


-- -r • U^KU 

(n-l) 

is a unit vector and K = 


(29) 


Kxx Kxy 
Kyx Kyy 


Kxx ■■= { exp ( ] ]i(* ^ j) 


( ||Xi-X2 ||"A 

• • exp ( 

1 Y- ) ■ 

0 

• • exp ( 


with its submatrices defined as 


IIW-x„ 


exp " j 


exp "^2 " j exp "^2 " j 


7" 

\\X2-Xr, 


and we use the first expression to summarize the above matrix and similarly, 


Kxy = Kyx = ) exp - 


\\X,,-YA 




Note that there are Os on the diagonal of K, but also on the diagonals of the other two submatrices. Note 

2 _ lE’llv V l|2 _ 11X112 


|2 = E|ix,-y,"2 

by Assumption [A4]. For i ^ j, let 


that 2Tr(S) + |l(5f = EHX, - x E||X, - = E||y, - = 2Tr(E) since |15f = o{Tr{Y)) 


T := 2TriY)/Y ^ E||5, - = o(l) 


(30) 


for Si e {X„Y,}. Let a = 
exp(—a) around exp(—r) as 




Let us write the exact third order Taylor expansion of the terms 


-T -r/ \ I ® ^2 e 3 

= e - e {a-T) + - (a- t) - — (a - r) 


(31) 


for some Qj between a and r, and since a, r > 0, we have exp(—) < 1. For clarity in the following ex¬ 
pressions, we drop the I(i ^ j) and assume it is understood. In this notation, the term-wise Taylor expansion 
of K is given by 


K = 
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= e 


+ - 


{ 1 } { 1 } 
{1} {1} 


— e 


\\x,-x, 

7" 

'\\Yi-X,\ 


— T 

— T 


\\X^-Y,\ 


I ( 


2 ! 


7^ 

m-x, 


T 


\\X,-Y,\ 

72 

\\Y-Y,\ 

72 




— T 


1 

'3! 








Recalling Eq.(|29]l and expanding using the above Taylor expansion of K, we get 

{E o T^)u 


gMMD = 


7^ (n — 1) 
where, recalling that o is the Hadamard product, 


® ^ Trp 2 rp 

U I 2 U— "• ' 


To := 


E := 


T. ■- 



3!(n- 1) 


\\X^-Y,\ 

p-2. 

'\\Yi-M 
, 7^ 



Note that we have used the fact that for u = 


i„/v^ 


we have 


{ 1 } { 1 } 
{ 1 } { 1 } 


u = 0 


and also that 


Ucq = j^Y. {- 11 ^* - ^^ 11 ' - - ^^ 11 ' + 11 ^* - + 11 ^^' - ■ 

V2/ i^j 

Further, recall from Eq.([30ll that r = 0 ( 1 ). 

The proof of the theorem will proceed from Eq.([3^ in three steps. Define 


U4 := 




h4{x,,x„Y,,Y,) := ||x, - + I|r, - y,-||4 _ ||X, - r,r - ||x, - Y^r 


to note that 




U4 tUcq 


T 


(32) 
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(i) First we will show that the third order Taylor remainder term := {E o T^)u is a smaller 

order term than Ucq /t^- 

(ii) Denote 62 = We will show that 62 = o(|j( 5 |p/ 7 ^). 


(iii) Denote S 4 = Var(U 4 ). We will show that VariJJ^I^^') = o{yar{UcQh‘^))- 

Both 04 and 54 are tedious to calculate, especially under the alternative, and we will have to develop a 
series of lemmas on the way to calculate these quantities. Assuming for the moment that these above claims 
are true, we then have from Eq.(|3^ that 

gMMD= ^(2e-" + op(l)) 

7 


Since we have assumed m > 8 moments, this immediately implies convergence of means and variances, i.e. 


EgMMD 



+ 0 ( 1 )) 


(33) 


and 

Far(gMMD) = ( 2 e-^ + 0 ( 1 ))^ 

j4 

which then implies that, ignoring smaller order terms, 

gMMD - EgMMD _ Ucq - ||(5f 
\/l/ar(gMMD) /g rr(s 2 ) 

\/ tj.2 n . 


(34) 


and hence the distribution of gMMD matches the distribution of Ucq under null and alternative (and the 
above expression has a standard normal distribution), and the two statistics hence also have the same power. 
The same argument also holds for the studentized statistics calculated in practice. The rest of the proof is 
devoted to proving the three steps (i), (ii) and (iii). 


Step (i): Bounding R 3 := o Tg)^ 


Noting that every element of is smaller than 1, and hence < ||i?oT 3 |j 2 < max^ E’ij IIT 3 II 2 < 

II +3 II 2 , implying that (ignoring constants) 


f?3 + 


|7Mk<^Mk 


Let us now bound every term of T 3 . Taking a union bound on the statement of Assumption [A3], we see that 
the same exponential concentration bound holds uniformly for all 0 {iR) = o{(R) pairs i,j, and hence w.p. 
tending to 1 , 


max 


11*5.-^,! 


7^ 


(we also multiplied both sides by d/ 7 ^). Hence we have w.p. tending to 1, 


f?3 + 


1 (U 

d^’'y/n 7 ® 


Since any random variable satisfies X = Op{^/Var{X)), we have that Ucq/^'^ = Op ^ 
the null (its variance is even larger under the alternate), and hence R 3 = op {Ucq/'j'^) whenever 


under 


1 (P _ ( 


7 ® 


i.e. 


717^ 


d^-Su 
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This is reasonably satisfied whenever 7 ^ > Tr{Yj) x d and n = as assumed. Hence, under 

our assumptions i ?3 = op{UcqH‘^)- 

Remark. We conjecture that this holds true under much weaker conditions on 7 , n, S, m. 


Step (ii): The Behavior of 6*4 = E[U4\ and 6*2 = :^u^E[T2]u 

Note the fact that for any random variable y, E(l/ —5)^ = Var{V) + {E,V— bY- Usings = ||Jf —y|p/ 7 ^, 
b = T and = t + ||(5|P/7^, we can write the off-diagonal terms as 


E 






f VardlX-X'lF)) 

fyardlX-YlF) \\S\f\ 

1 7^ J 

\ Y + 7 "/ 

Var{\\X-Y\\^) ||5||M 

^4 -h ^4 J 

f VardlY-YdF) 1 

t J 


Since Var{\\X — X'|p) = Var{\\Y — we have 

02 = Var{\\X - X'f) - VariWX - Ff) - \\6th\ 

The next two propositions imply that O 2 = — 8 (j^S( 5 / 7 ^ — ||(j||^/ 7 ^ = o(||i 5 |p/ 7 ^), as required for step (ii). 
They also imply that 


04 = -16,5^S,5 - 8||5f rr(S) - 2||5f x -||,5f Tr(E). 


Proposition 4. Define Z' = Z\ — Z 2 where Zi, Z 2 are as in assumption [Al], [A2]. Then 

E(F'^EZ') = 2Tr{X) 

Var{Z'^YZ') X TrjS^) 

E[(Z'^EZ')^] X Tr‘^(T,) 


Proof. Since Z \, Z^ are independent, zero mean and identity covariance, we have Z' is mean zero and co- 
variance 2/ and fourth moment EZjf = E(Zif. — ^ 2 ^)^ = 3-1- A 4 -f 6 -I- 3 -I- A 4 = 12-1- 2 A 4 . Firstly 

ElZ''^EZ'] = ETr(Z'^EZ') = TrE(Z''^EZ') = Tr(E(EZ'Z''^)) 

= 2Tr(E) 

where the last step follows since E[Z'Z''^] = 21. 

Var{Z'^EZ') = E[Z'^EZ'f - [2Tr{E)f = E ^ - 4(^ Euf 

ij,k,l i 

= 4 ^ ^ + 8 ^ ^ + (12 + 4 A 4 ) ^ El - 4(^ E^ + ^ ^ E^.E^^) 

i j^i i j^i i i i j^i 

= 8Tr{E^) + 4A4Tr(E o E) 


where the third step follows because the only nonzero terms in ^ ^ ^ are because (a) i = j and k = I i 
or (b) z = fc and j = I i 01 (c) i = I and j = kf^i 01 {d)i=j = k = l and the last step follows because 
Tr(E2) = ||E|||, = . E^. The lemma is proved because X)* El < E^. 


Hence E[{Z''^EZ'f] 


Var{Z''^EZ') -f (EZ'^EZ')^ = 8Tr{E^) + 2 A 4 ^ El + 4Tr‘^{E) 

i 

Tr^{E). 
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Proposition 5. Let X, Y be as in assumption [A1 ], [A2], [A3]. Then 


E||X-rf = 2Tr(S) + ||<5f , 
Var{\\X-Yf) x 8 Tr{Y^) + 8 S^Y 6 , 

E||X-rf X 4Tr2(S)+4||<5frr(E), 


Proof. Remember that X — Y = r(Zi — Z 2 ) -\- 6 =: TZ' Y 5. Note that Z' has zero mean, variance 21 and 
every component is independent with third moment zero. Hence 


E\\X-Yf 


Hence yar||X — Y\\‘^ 


E\\TZ' + (5f = E[Z'^nZ'] + \\Sf + 2E[5^TZ'] 

2Tr(E) + ||<5f. 

E[|irZ'+ - (2Tr(E) + ||5f )]2 

E[Z''^nZ' + 26 '^TZ' - 2 Tr{Y)f 

Var{Z''^IiZ') + m[6'^TZ'Z''^T'^5] + YE[{Z''^nZ' - 2Tr{Y))5^TZ'] 


= 8Tr(E2)+4A4rr(EoS)+ 8(5'^E(5 + 4E 


Y,n,,z[zrz'^ 


T ^6 


= 8Tr{Y'^)+4:AiTr{YoY) + 8S'^ES 


The second last step follows since E j ~ ^ since Z' has first and third moments 0. 


Hence E||A-rf 


t/ar(||A - rf) + (E||A - yf )2 

Var{Z'EZ') + Yrr‘^{Y) 

8 Tr{Tf) + 4A4rr(E o E) + 85'^E5 + 4Tr2(E) + 4||5f rr(E) + ||5||^ 



Step (iii): The Behavior of S4 = Var{U4) 

We use the variance formula using the Hoeffding decomposition of the U-statistic U 4 . We ignoring con¬ 
stants since we only aim to show that Var{U 4 /^‘^) is dominated by (is an order of magnitude smaller than) 
Var{UcQ/"f‘^). Hence, we have by Lemma A of Section 5.2.1 of Serflin^l 2009), 


Var{U 4 ) X + Var{E[h 4 \X,Y]) 


n 


(35) 


Some tedious algebra is required to estimate the second term. Recall that 


U 4 

h 4 {X„Xj,Y„Yj) 

0 


( 2 / i:^j 

||A, - A,r + I|y, - Y,f - ||A, - - ||A, - r,r, 

E||A, - A, r + E||y, - Y,f - E||A, - F.-f - E|| A, - Y,\\\ 


where A, A' ~ P and Y,Y' Q from the model in [A1,A2] given by A = TZi and Y = TZ 2 Y 5. (since 
/i 4 depends only on differences, we have assumed (5i = 0 and 82 = 5 without loss of generality). Firstly, 
it is easy to verify that /14 is a degenerate U-statistic under the null, since E[/i 4 |(A, F)] = 0 when P = Q. 
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We will now derive the variance of E[/i 4 |(X, y)] when P ^ Q under our assumptions. Let us first derive 
E[/i 4 |(X, y)] below. For convenience of notation, denote 


Y = TZy 


where Zy = Z 2 + r] and Tr] = S. Then 


||x-r'||4 

E[||x-y'ri(y,y)] 

||x'-y||4 

E[||x'-yri(x,y)] 


{x'^x + y^Y' - 2X^Y'f = {x^xf + + 4(x'^y')^ 

+2X^XY''^Y' - YY''^Y'X^Y' - AX'^XX'^Y’, 

{X'^Xf + V\{Z'J ^Z'yf] + 4X^(E + 55'^)X + 2X^X{Tr(Y) + Pf) 
-4E[z^ nz!yZ!^ r^]rzi - ax'^xx^s, 

{X'’^X' + Y^Y - 2X''^Y f = {X''^X' f + {Y'^Yf + A{X'^YY 

+ 2 x'^x'y^r - 4y^yx'^y - ax''^x'x'^y, 

E[{Z[^UZ[f] + (Y'^Yf + 4y'^sr + 2Y^YTriY) 
-AE[Z[^nZ[Z[^y]{TZ2 + 6). 


Denoting Oy := EIZyUZyZy], we have 


UYk = E[(y^ UijZYiZyj + Y.^uZl,)ZYk 

i=jtj i 


= E 


En. i{Z2iZ2j + 'rijZ2i + TiiZ2j + r]i'r]j){Z2k + Vk) 


+E 


E! + ‘^Z2irji + rjf){Z2k + Vk) 


0 + 0 + E! ^kjVj + 0 + E! ^ikVi + 0 + 0 + Pfe E^ Vi^ijVj 

i^k i^k 

+ Aallfefc + rjk En. i + ZYlkkVk + 0 + 0 + 77fc 

+ 7?fc(En?7) 


2 E 

+ 

Aallfcfe + pfcTr(n) + 2nfefcpfc 

j¥^k 


- 


= Aallfcfe + ?7fcTr(n) + 211^77 + pfcPp. 


Since lip = F^Fp = F^^, we have Oy = A^diag)!!) + prr(n) + 2r^(5 + Ppp. Using this and calling 
Oy = E[ZiIlZiZi] = A3diap(n), 


-E[||A-y'p|(A,F)] 

-E[||A'-rp|(A,r)] 

E[|iy-y'ri(A,r)] 

E[|lA-A'r|(A,r)] 


-{X'^X)^ - E[(Z'^ nZ'Y)^]-AX^YX - AX^SS'^X~2X^XTr{E) 
-2A^APf +4a^r^A + ATr{E)6^X + SS^YX + A\\5\\^5’^X + AX^XX^d, 
-E[{Z'^JAZ[)‘^] - {Y'^Y)‘^-AY^YY ~ 2Y'^YTr{Y)+Aa^yY, 

(jTyY + E[(Z^ nZ(.)2]+4y^Ey + 4r^(5(5'^y+2r^yTr(E) + 2Y^Y\\Sf 
-Aa^yY - ATr{Y)6^Y - 86^YY - 4P||P^y - AY'^YY^S, 

E[{Z'^IiZ[)^] + (A'^A)2+4A^EA + 2X^XTr{Y)-Aa^yX. 


Adding the above 4 equations, we get 

E[h 4 \{X,Y)] = Ad^iYY^ - XX^)6 + 2{Y'^Y - X^X)\\df-ATr{Il)S^(Y - X) 
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85'^E(r -X)- 4\\S\\^S^{Y -X)- 4(Y'^YY'^ - X'^XX^)S. 


(36) 


We will now take a detour to calculate the expectations and variances of products of quadratic forms, to 
aid us in bounding Var{K[h 4 \{X, E)]) by bounding the variances of each term in Eq.([3^ above. 

Proposition 6 . Let Q := e^Ile be a quadratic form, where e is standard normal. Then 


E[Q] 

E[Q 2 ] 

Var{Q) 

E[Q3] 

E[Q4] 

Var{Q‘^) 


= Tr{U) 

= Tr'^{U) + 2Tr{U'^) 

= 2Tr{U^) 

= Tr^(U) + 6Tr{U'^)Tr{U) + 8Tr(n^) 

= Tr^(U) + 12Tr(lf)Tr^{Y) + 12Tr^(n^) + 32Tr(n)Tr(n^) + 48Tr{n*) 
= Tr^(U) + 12Tr(lf)Tr^(n) + 12Tr^(n^) + 32Tr{n)Tr{n^) + 48Tr(U*) 
- (Tr^(n) + 4rr2(n2) + 4Tr(n^)Tr'^{U)) 

< 96Tr(n^)Tr^(n) 


Proof. The expectations follow directly from the results of |Magnus 1 1979| and Kendall and Stuart]p977| . 
The last equation follows since Tr{AB) < Tr{A)Tr{B) for any two psd matrices we have Tr(n^) < 
Tr2(n) and Tr{n^) < Tr(lP)Tr(Il) and Tr{Il'^) < Tr(lP)Tr^{Il). by Cauchy-Schwarz. 


Proposition 7. Let Ts{A) = Aij denote the Total sum of all entries of A and let o denote Hadamard 
product. Let Q = e^IIe, where the moments of the coordinates of e are given by 


mi = 0 , 

m2 = 1, 

^3 = As, 

m4 = 3 + A4, 

ms = As + IOA3, 

mg = Ag -f I5A4 -f IOA2 T 15, 

mz = A7 + 21As + 35A453 + IO5A3, 

mg = Ag + 28A6 + 56A5A3 + 35A^ + 21OA4 + 280A2 + 105. 


Here the As should be thought of as deviations from normality. A 3 is skewness and A 4 is kurtosis, and 
Ai = 9 for all i if e was standard Gaussian. Then, we have 


E[Q] 

Var[Q] 

E[Q2] 

E[Q4] 


where 


fe 

fs 

h 


Tr[n), 

2Tr(n2) + A4Tr(non), 

2Tr{T\^) + A 4 Tr(n oU) + Tr^^iU), 

Tr*{n) + 12Tr{U^)Tr^{U) + 12Tr^{U^) + 32Tr(n)Tr{n^) + 48Tr(n^), 

+A4/2 + A6/4 + Ag/g + A3/3 + A4/42 + A3A5/3S 
6Tr'^(n)Tr(n o n) + 12Tr(tf)Tr(n o H) + 48Tr(n)Tr(n o tf) 

+96Tr{diag{Il)U^) + 48Tr{diag‘^(n^)), 

4Tr(U)Tr(n oUoU) + 24Tr(n o H o tf), 
rr(nononon), 

24Ts(dm5(n)n(fmp(n))Tr(n) + 48Ts{diag{U)U^diag(n)) + 16rs(n o H o n)Tr(n) 
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+96rs((n o n)n(im 5 (n)) + 96rr(n(n o n)n), 

/42 = 3Tr2(non)+ 24rs((iiag(n)(non)dia5(n))+ 8Ts(nononon), 
/35 = 24Ts((i'ia(?(n)n(iiag^(n)) + 32Ts(c?zag(n)(n o n o n)), 

Var{Q^) X Tr{U'^)Tr^{U). 


Proof. The first four claims follow directly from the detailed work of Bao and Ullah | 2010| . Let us see how 
the last claim then follows. First note that Tr(Jl o 11) < Trili^) < Tr^(If). The first inequality follows 
because = |ln|j|^ = rr(n^). The second follows because 0 < Trijf^) = (11,11) < 


n|iop||n||. < Tr^(n) by Cauchy-Schwarz. We also use the Hadamard product identity (ifa( 7 (n)(n o 
n)(ifap(n) = (dm( 7 (n)n) o (ndzap)!!)) = (n(imp(n)) o ((imp(n)n) = n o ((imp(n)n(ifa( 7 (n)), see 
Horn and Johnson 1 1991|. Since Tr{AB) < Tr{A)Tr{B) for any two psd matrices, we similarly have 


rs(n o n o n) = ^ n^- < ^ |n„f < < Tr(n2)Tr(n) 

ij ij ij 

Tr(nonon) = ^n3 < < Tr(n 2 )rr(n) 

i i ij 

Ts(n o n o n o n) = ^ = (n o n,n o n) < Tr2(n o n) < Tr{u^)Tr^(u) 

^3 

rr(n o n o n o n) < Tr{n^)Tr^{Ji) 

Tr{diag{TA)A^) < Tr{diag{n))Tr{A^) < Tr{A^)Tr^{A) 

Tr(n(n o n)n) < rr(n)Tr(n o n)rr(n) < Tr{n^)Tr'^{T]) 
rs(dmp(n)(non)dfap(n)) < Tr'^{Ji)Tr{n^). 


In this fashion, we can verify that the dominant term of Var(Q'^) scales as Tr{fl^)Tr^{If). 



We can now extend these results to the case where the quadratic form is uncentered. 

Proposition 8. Q = e^FIe and Q' = Q + a"’" e+h, where e satisfies the conditions of the previous proposition, 
a^a = 4:6"’" and b = d^’^S. Then 


E[Q'] 

= 

Tr {If) + b 

Q'^ 

= 

+ (a^e)^ + 6 ^ + 2Qa"’"e + 2ba"’"e + 2bQ 

EQ'2 

X 

Tr'^iU) + 2Tr{U^) + a^a + b^ + 2A3diag{U)a + 2bTr(n) 

Var{Q') 

X 

2Tr{If3) -\- a + 2A3dfap(n)o 

Var{Q'^) 

< 

2Var{Q‘^) + 4(a^a)^ + 2Afrr{aa’" 0 aa’") + 4Var{Qa’"e) 
+4b^a^a + 8b^Tr{U^) + 4b^AiTr(n 0 H) 


X 

Tr2(n)Tr(n2) 


X 

Var{Q^). 


Proof. All statements hold simply by expansion and substitution from the previous proposition. Remember¬ 
ing that Var{Q^) x Tr(Y?)Tr‘^{Y?), we can see that the last claim holds. Indeed, Assumption [A4] implies 
that 0^0 = o(Ainax(S)rr(E)) and hence (a^a)^ = o(Tr(E^)Tr^(E)) since — linin’ = Tr{Tf). 

Similarly, b‘^a’"a = o{Tr^{Yi)Tr{Yf)). In this fashion we deduce that the dominant term in Var{Q'^) is 
Var{Q'^). 

Since Var{A + B) < 2Lar(A) + 2Var{B) and (a + b + c)^ < 3a^ + 36^ + 3c^, we can alternately 
derive the following bound for variances of quadratic forms involving Y = TZ 2 + 6: 

= Z^UZ2 + S"^S + 2S'^TZ2 
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= z^u^Z2 + + 2S'^j:rz2 

(yTy)2 ^ 3(zJnZ2)^ + 3(,5^(5)2+3((5^rZ2)^ 

E[y^r] = Tr(I]) + S'^d 
E[y^Er] = Tr(E2) + S^Y.6 
Var{Y^Y) < 4Tr{Y^) + 86^Y6 
E[(y7’y)2] ^ Var{Y'^Y) +E^{Y'^Y) 

< 4 TriY^) + 8d^Yd+{Tr{Y) + S^df x Tr^{Y) 
Var{Y^YY) < 4 Tr{Y^) + 86^Y^6 
Var{(Y'^Yf) < l8Var{{Z^IlZ2f) + l8Variid^rZ2)‘^) 

X Tr(E2)Tr2(E) + {S^YSf 


where we used var{(v'^ZY) = var{Z’^vv'^Z) = 2Tr{{vv'^Y) = 2{v'^vY. Since S"^Y6 
our assumptions, the last expression is dominated by its first term. 


o{Tr{YY) by 



Proposition 9. 

Var(X'^XX^6) x Tr^{Y)S^YS 
Var{Y^YY^6) x Tr‘^{Y)S^Y6 

X Var{X'^XX^S). 

Proof. Let us first calculate Var{X"’"XX"^S), for which we need to know E[XX'^XX"’"XX"’"]. Let us 
first calculate E[XX"’'XX'^]. For this purpose, see that E(ZiZfnZiZf) = E{{Z'^nZi)ZiZ'[) = 211 + 
Tr(n)/. This is true because its off-diagonal element is HijZiZjZaZb) = 2Ilab, and its diagonal is 

E(E.J = Tr{n)+2Uaa- HenceE(XX^XX^) = rE{Z,ZfUZ^Zl)r^ 

2E^ + Tr(S)E. Now, we are ready to calculate E[XX'^XX"^XX'^]. 

Define C := E^Z^TiZif Z^Zl) 

Hence Caa = E(^ UijllkiZiZjZkZizl) 

ijkl 

= i5nL + 6n,„(^n„) + i2^nL + 3^n2, + 2 ^ n,,n,* + 4 ^ 

tz/za t^a tz/za sz^tz^a sz/ztz/za 

Let us simplify this expression. Notice the following identities: 

2Tr(n2) = 2nL + 4 ^ + 2 ^ n?, + 4 ^ 

tz/za tz/za sz/ztz/za 

Tr2(n) = nL + ^n2, + 2^n*,n„„-b2 ^ 

tz/za tz/za sz/ztz^a 

8n^,n, = 8nL + 8^n?„ 

t^a 

4Tr(n)naa = 4nE + IlttHaa 

t^a 

Hence, we see that Caa = + 4rr(n)n„„ + 2(n2),, + 2Tr{nY + Tr^Tl) 

Similarly Cab = E(H.j H^; Zj Zj Zk zi ZgZb ) 

ijkl 
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— 8 Hat^bt + 4 Ilab^tt + ^‘^^aa^ab + ^^IlbbTlab 

t^a^b t^a^b 

= mabTriIl)+8iIl^)ab 

Hence (7 = 8^+ 4Tr(n)n + {2Tr(n^) + Tr^{U))I 


Hence 


E[XX'^XX^XX'^] = 8E^ + 4rr(S)S^ + 2Tr{X^)Y. + rr2(S)S (37) 

and Var{X^XX'^6) x S'^Y.^S + Tr(Y)d^Y^S + Tr‘^{Y)S'^YS 
X Tr^{Y)S^Yd. 

Next, let us calculate Var{Y"’'YY"^5). We keep only the higher order terms in the following expansions, to 
avoid the tediousness of Proposition]^ for clarity. 

E[ry'^] = Y + 55^ 

E{Y^YY^S) = E[(rZ 2 + Sf(TZ 2 + S){Z^T^S + (5^5)] 

= ||(5f (Tr(E) + ,5^5) + 2(5^S,5 
X ||<5f Tr(S) 

E[YY^YY^] = E[(rZ2 + S)(rZ 2 + Sf{TZ 2 + S)(rZ 2 + <5)^] 

X rsr^ + SS^{Y + 66 ^) + SiTr{Y) + S^S) 6 ^ + (E + 66 ^) 66 '^ + ||(5f (S + 

+ E[(5zJr^(5Z2^r'^] + E[rz2(5^rz2(5^] + ||<5f 55^ 

E[d^YY^YY^S] = 2S^Y^S + Tr{Y)S^YS + 5||(5f + 5||(5f + ||(5f rr(E) 

X 5'^S<5Tr(E) + ||5f Tr(S) 

E[6^YY^YY^YY^d] = <5^E[(rZ2 + 6){TZ2 + 6f{TZ2 + d)iTZ2 + Sf{rZ2 + S)(TZ2 + i5)^](5 
X \\S\\^{E[S'^YY'^YY'^S]) + 6 ^E[TZ 2 Z^T'^YY'^YY^]S 
+ E[S'^TZ2S'^YY'^YY^]6 + \\S\\^E[Zjr'^YY^YY^]S 
:= G 1 +G 2 + G 3 + G 4 

Dehne $ := r^(5(5^r, and let us expand the 4 terms above. 

G 2 = S^E[rZ2ZjT^YY'^YY'^]S = S'^E[XX^XX'^XX^]5+\\SfS'^E[XX'^XX'^]S + 3\\SfE[Zj<^>Z2Z^UZ2] 

+ 2 E[{Z^^Z 2 )^]+E[Z^<t>Z 2 ]\\ 6 f 

X + Tr{Y)S^Y‘^d + Tr'^{Y)5'^Y5 + ||(5f + \\dfS^Y 6 Tr{Y) 

+ { 6 '^Ydf + 6 ^Y 6 \\ 6 f 

X Tr^{Y)S^YS 

Gi = ||5||2(E[,5^Fy'^yr^<5]) = ||(5f Tr(E) + ||(5f ,5^E5Tr(E) 

^ G 2 

G3=E[5^TZ2S^YY^YY'^]S = 2E[Z'^^Z2Z^nZ2]\\5f +2E[{Z^<^Z2f]+YE[Z'^^Z2]\\5f 

X ||(5f ,5^E(5Tr(E) + ||(5f + (5^E(5|i(5f 

^ G 2 

G 4 = ||^fE[zJr^Fy^yy^](5 = \\s\\‘^E[{z^iiZ 2 f] + 3\\sfE[z^uz2Z^^Z2] 

+ \\S\\^E[ZjllZ2]+3\\SfE[Zj<^Z2] 
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X WSfTr^iE) + ||5fj'^SJrr(S) + ||(5f+ ||^frr(E) + ||5||4(5'^E<5 

Hence Var{Y^YY^6) = E[6^YY'^YY'^YY^S] - E‘^[Y'^YY'^6] 

X Gi + G2 + G3 + G4 - M^Tr^iE) 

X Tr^{E)S^Ed 
X Var{X'^XX^S) 



Lemma 1. 

Var{E[h4\iX,Y)]) x Tr‘^{E)S^ES 


Proof. Returning back to Eq.(|36]l, the 4 different variance terms involved in Var{E[h4\{X, F)]) are 


VariY^SS'^Y) 
Var{Y^Y\\Sf) 
Var(Tr(Il)6^r{Z2 - Z4)) 
Var{Y^YY^d) 


Var{{rZ2 + Sf6S^{rZ2 + 6)) x {S^ESf + ||(5||'‘(5^E(5 

Tr^iE)S'^ES 

Tr‘^{E)6^E6 


Under our assumptions, one can verify that the dominant term of Var{E\h4\X, F]) is x Tr^{E) 5 '^E 5 . 



Lemma 2. 

Var{h4) X Tr‘^{E)Tr{E^) 

Proof. 

h4 = 4[(X^X'y + {Y'^Y'f - {X'^Y'f - {X''^Yf] 

+ 2 [X'^X{X''^X' - Y'^Y') + Y'^YiY'^Y' - X'"^X')] 

+ A\Y'^Y'Y'^{X -Y)+ X'^X'X'^{Y - X) + X'^XX^(Y' - X') + Y'^YY^{X' - Y')] 

(38) 


For example, let us calculate Var{{X'^X'Y). Defining S' = X'X'"^, we have 

E[{X^X')^]=Ex'Ex[iX^S'Xf] =Ex'Ezd{Zir^S'TZi)^] 

= Ex'[Tr(r^X'X'^TT^X'X''^T) + Tr^(r^X'X''^T)] 

= Ex'[iX''^EX'f + {X'^EX'f] 

= Ex'[iz[^u^z'^f + (zf 
= 2Tr(n4) +Tr2(tf) 

E[{X^X'f] = Ex'ExlZjv'^S'TZi] = Ex'Tr{T'^ X' X'^T) = Ez'^Z’^lV^ Z'^ 

= Tr{n^) 

Var{{X'^X'f) = E[{X^X'f] - E[{X'^X' ff = Tr{n'^) = Tr{E^) = o{Tr^{E)Tr{E^)) 


Similarly, let us calculate Var{X'^X'X^X) and Var{Y'"^Y'Y^Y) as follows. 
VariX'^X'X'^X) = E[{X'^Xf{X''^X'f] - E‘^[X'^XX''^X'] 
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and Var{Y'^Y'Y'^Y) 


E'^liX^Xf] - E'^lX'^X] X {8Tr{E'^) + ATr'^{Y) f - (2Tr(S))4 
Tr{E^)Tr‘^{E) 

E^[{Y'^Yf\-E‘^{Y^Y) 

{Tr‘^{Y) + 2Tr{Y^) + + 5^5 

+ 8Xidiag{n)5'^Y5 + 25'^5Tr{Y)f - {Tr{Y) + 5'^5f 

Tr‘^{Y)Tr{Y^) 


where we use Proposition!^ and the last step follows by larger terms canceling after direct expansion. 

Next, let us bound Var{X"^XX"^X') and Var{Y"^YY"’"Y') as follows (other terms are similar). Multi¬ 
plying Eq.(|J7|l by E, we see that 

E[XX'^XX^XX'^Y] = 8E^ -f 4rr(S)S3 -f 2Tr{Y'^)Y^ + Tr'^{Y)Y^. 


Now taking traces on both sides, and applying trace rotation to the left, we see that the dominant term is 
Tr{E[XX^XX^XX^Y]) = E[Tr{X^XX^XX^YX)] = E[{X^XfX^YX] x Tr(Y'^)Tr^{Y). 
Since Var{P) < E[P^], we conclude that 

Var{X'^XX^X') < E[X^XX'^{X'X'^)XX'^X\ = E[X'^YX{X^Xf\ x Tr'^{Y)Tr{Y'^). 


Then, taking expectations with respect to Y' first, we get 

Var{Y^YY^Y') = E[F^(E + SS^)YY^YY^Y] - E^[Y^YY'^6] 

= E[Y^YY{Y^Yf] + Var{Y^YY^5) 

X ElZ^Y’^ZviZ^YZy^] + Tr^{Y)S'^Y6 
X {S^YSfS'^Y^S + 4{6^Y^6f + 8{S'^YS){6^Y^S) + 86^Y^6 
+ 4Tr{Y^)[S^Y^S + -p 8Tr{Y)[d^Y^S + {S^Y^6){S^YS)] 

+ 8Tr{Y^)S^Y'^6 + 6Tr{Y)S'^Y^S + Tr'^{Y)Tr{Y^) 

+ 4Tr{Y^)Tr{Y) + 2Tr^{Y^) + 8Tr(E^) 

X Tr‘^{Y)Tr{Y‘^). 


The above results are obtained in a fashion similar to Propositi on [8] for variance of u ncentered quadratic 
forms, or Proposition]^ for Var{Y'^YY^6), or from the results of Bao and Ullah 1 2010) about momnents of 
products of non-normal quadratic forms (Pg. 255 of Ullah 1 2004) for the Gaussian case). Hence, bounding 
the Varihi) by (a constant times) the sum of variances of the terms in the expansion Eq.(|3^, we see that 

Var{hi) X Tr‘^{Y)Tr{Y'^) 


as required, concluding the proof of the lemma. 



In summary, using Eq.(|T5]l, we have the variance of as 

< C. IlSM + ft < CTrHm.r(U,,) 


for some absolute constants Ci, C 2 , C = max{Ci,C 2 }- 
Since 7 ^ = a;(Tr(E)), we see that 


VariUi/^'^) = o{Var{UcQ/l'^)) 


as required for step (iii). 
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Remark 15. Recall that it is typically stated in textbooks like \Serfling\fi200^ , that for degenerate U-statistics, 
the variance under the null is 0(l/n^), and variance under the alternative is 0(l/n). While this is true 
asymptotically when n ^ oo in the fixed d setting, the variance under the alternative can still be 0 (l/n^) in 
the high-dimensional setting, depending on the signal to noise ratio and dimension when d,n ^ oo. 

The conclusion of step (iii) also concludes the proof of Theorem]^ 


8.1 Proof of Theorem |3] 

The only difference from the above proof, is that instead of taking the Taylor expansion of the Gaussian 
kernel, we take the expansion of the (modified) Euclidean distance. This gives rise to the exact same set 
of terms to bound, with different constants. Indeed, when 7 ^ = a;(Tr(E)), by the exact form of Taylor’s 
theorem for /(•) = (1 + at a = around r = = ^( 1 )’ 


/(a) = Ht) + 


ja-T) 

2(1 + t)i/2 


ja-rr 

8(1 + r)3/2 


3(a — t)3 
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(l^^)-5/2 


(39) 


for some C between a and r. Comparing Eq.([39ll with Eq.(|3T]l, we see that all the terms are exactly the same, 
except for constants. Hence, exactly the same proof of Theoremj^goes through for Theoremj^as well. 


Acknowledgments 

This project was supported by the grant NSE IIS-1247658. 


References 

Niall H Anderson, Peter Hall, and D Michael Titterington. Two-sample test statistics for measuring dis¬ 
crepancies between two multivariate probability density functions using kernel-based density estimates. 
Journal of Multivariate Analysis, 50(l):41-54, 1994. 

Theodore W Anderson. An introduction to multivariate statistical analysis. 1958. 

Theodore W Anderson and Donald A Darling. Asymptotic theory of certain goodness of fit criteria based on 
stochastic processes. The annals of mathematical statistics, pages, 193-212, 1952. 

Zhidong D Bai and Hewa Saranadasa. Effect of high dimension; by an example of a two sample problem. 
Statistica Sinica, 6(2):311-329, 1996. 

Yong Bao and Aman Ullah. Expectation of quadratic forms in normal and nonnormal variables with applica¬ 
tions. Journal of Statistical Planning and Inference, 140(5): 1193-1205, 2010. 

L Baringhaus and C Eranz. On a new multivariate two-sample test. Journal of multivariate analysis, 88(1): 
190-206, 2004. 

Alexandre Belloni and Gustavo Didier. On the behrens-fisher problem; a globally convergent algorithm and 
a finite-sample study of the wald, Ir and Im tests. The annals of Statistics, pages 2377-2408, 2008. 

Peter J Bickel. A distribution free version of the smimov two sample test in the p-variate case. The Annals of 
Mathematical Statistics, pages 1-23, 1969. 

Tony Cai, Weidong Liu, and Yin Xia. Two-sample test of high dimensional means under dependence. Journal 
of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):349-372, 2014. 


32 








Song Xi Chen and Ying-Li Qin. A two-sample test for high-dimensional data with applications to gene- 
set testing. The Annals of Statistics, 38(2):S08-^35, wpr 2010. doi: 10.1214/09-aos716. URL http: 
//dx.doi.org/10.1214/09-aos716 

Harald Cramer. On the composition of elementary errors: First paper: Mathematical deductions. Scandina¬ 
vian ActuarialJoumal, 1928(1): 13-74, 1928. 

Arthur P Dempster. A high dimensional two sample significance test. The Annals of Mathematical Statistics, 
pages 995-1010, 1958. 

Noureddine El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38(1): 1-50, 2010. 

V Alba Fernandez, MD Jimenez Gamero, and J Munoz Garcia. A test for the two-sample problem based on 
empirical characteristic functions. Computational statistics & data analysis, 52(7):3730-3748, 2008. 

Jerome H Friedman and Lawrence C Rafsky. Multivariate generalizations of the wald-wolfowitz and smirnov 
two-sample tests. The Annals of Statistics, 691-111, 1979. 

A. Gretton, K. Borgwardt, M. Rasch, B. Schoelkopf, and A. Smola. A kernel two-sample test. Journal of 
Machine Learning Research, 13:723-773, 2012a. 

A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and K. Fukumizu. 
Optimal kernel choice for large-scale two-sample tests. Neural Information Processing Systems, 2012b. 

Arthur Gretton, Karsten M Borgwardt, Make Rasch, Bernhard Scholkopf, and Alex J Smola. A kernel method 
for the two-sample-problem. In Advances in neural information processing systems, pages 513-520, 2006. 

Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014. 

Norbert Henze. A multivariate two-sample test based on the number of nearest neighbor type coincidences. 
The Annals of Statistics, pages 772-783, 1988. 

Roger A Horn and Charles R Johnson. Topics in matrix analysis. Cambridge Univ. Press Cambridge etc, 
1991. 

Harold Hotelling. The generalization of student’s ratio. Annals of Mathematical Statistics, 2(3):360- 
378, aug 1931. doi: 10.1214/aoms/l 177732979. URL http ://dx. doi . org/10.1214/aoms/ 
1177732979 

Yuri Ingster and Irina A Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. 
Springer Science & Business Media, 2003. 

Takeaki Kariya. A robustness property of hotelling’s t2-test. The Annals of Statistics, pages 211-214, 1981. 

Maurice Kendall and Alan Stuart. The advanced theory of statistics, vol. 1: Distribution theory. London: 
Griffin, 1977, 4th ed., 1, 1977. 

Andrej N Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. na, 1933. 

Erich Leo Lehmann and Howard JM D’Abrera. Nonparametries: statistical methods based on ranks. 
Springer New York, 2006. 

Miles Lopes, Laurent Jacob, and Martin J Wainwright. A more powerful two-sample test in high dimensions 
using random projection. \n Advances in Neural Information Processing Systems, pages 1206-1214, 2011. 

R. Lyons. Distance covariance in metric spaces. Annals of Probability, 41(5):3284-3305, 2013. 

Jan R Magnus. The expectation of products of quadratic forms in normal variables: the practice. Statistica 
Neerlandica, 33(3):131-136, 1979. 


33 


Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, and Larry Wasserman. On the decreasing 
power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Proceedings of 
the 29th AAAI Conference on Artificial Intelligence (AAAI 2015), 2015. 

Sashank J. Reddi, Aaditya Ramdas, Barnabas Poczos, Aarti Singh, and Larry Wasserman. On the high 
dimensional power of a linear-time two sample test under mean-shift alternatives. In Proceedings of the 
18th International Conference on Artificial Intelligence and Statistics (AISTATS 2015), 2015. 

Paul R Rosenbaum. An exact distribution-free test comparing two multivariate distributions based on adja¬ 
cency. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(4);515-530, 2005. 

W. Rudin. Fourier analysis on groups. Interscience Publishers, New York, 1962. 

O.V. Salaevskii. Minimax character of hotellings t2 test. i. In Investigations in Classical Problems of Proba¬ 
bility Theory and Mathematical Statistics, pages 74—101. Springer, 1971. 

Mark F Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the American Statis¬ 
tical Association, 81(395):799-806, 1986. 

Bernhard Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. 

D. Sejdinovic, B. Sriperumbudur, A. Gretton, K. Fukumizu, et al. Equivalence of distance-based and RKHS- 
based statistics in hypothesis testing. The Annals of Statistics, 41(5);2263-2291, 2013. 

Robert J Serfling. Approximation theorems of mathematical statistics, volume 162. John Wiley & Sons, 
2009. 

JB Simaika. On an optimum property of two important statistical tests. Biometrika, pages 70-80, 1941. 

Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions. The annals of mathe¬ 
matical statistics, pages 279-281, 1948. 

Muni S. Srivastava and Meng Du. A test for the mean vector with fewer observations than the dimen¬ 
sion. Journal of Multivariate Analysis, 99{3):3^6-4Q2, mar 200^. doi; 10.1016/j.jmva.2006.11.002. URL 
http://dx.doi.org/10.1016/j.jmva.2006.11.002 

Muni S Srivastava, Shota Katayama, and Yutaka Kano. A two sample test in high dimensional data. Journal 
of Multivariate Analysis, 114:349-358, 2013. 

Gabor J Szekely and Maria L Rizzo. Testing for equal distributions in high dimension. InterStat, 5, 2004. 

Aman Ullah. Finite sample econometrics. Oxford University Press Oxford, 2004. 

Richard Von Mises. Wahrscheinlichkeit statistik und wahrheit. 1928. 

Abraham Wald and Jacob Wolfowitz. On a test whether two samples are from the same population. The 
Annals of Mathematical Statistics, 11(2):147-162, 1940. 

Wojciech Zaremba, Arthur Gretton, and Matthew Blaschko. B-test: A non-parametric, low variance kernel 
two-sample test. In Advances in Neural Information Processing Systems, pages 755-763, 2013. 


34 


A An error in Chen and Qin [2010| : the power for high SNR 


We briefly describe an error in Chen and Qin|pOTO) , that has a few important repercussions. All notations, 
equation numbers and theorems in this paragraph refer to those in |Chen and Qin|]2010| . Using the test 
statistic Tnl&ni dehned below Theorem 2 in Chen and Qin | 2010] , we can derive the power under their 
assumption (3.5) as 


Pi 
= Pi 
$ 

= $ 


> Ca) = 

^nl / 

- IlfTi - /XalP ^ CT„i ^ |lAtl-M2|P 

^ ^ - Sa 

0’n2 <^712 <^n2 


IlMl - M2| 


0’n2 

V^II/il-M2||^ 


(the denominator is not ani) 


which should be the expression for power that they derive in Eq.(3.12), the most important difference being 
the presence of y/ri instead of n in the numerator. 


35 
























